January 2019
Probability is the expression of belief in some future outcome
A random variable can take on different values with different probabilities
The sample space of a random variable is the universe of all possible values
The sample space can be represented by a
Describes the expected outcome of a single event with probability p
Example of flipping of a fair coin once
\[Pr(X=\text{Head}) = \frac{1}{2} = 0.5 = p \]
\[Pr(X=\text{Tails}) = \frac{1}{2} = 0.5 = 1 - p \]
\[ p + (1-p) = 1 \]
\[ Pr(\text{X=H and Y=H}) = p*p = p^2 \] \[ Pr(\text{X=H and Y=T}) = p*p = p^2 \] \[ Pr(\text{X=T and Y=H}) = p*p = p^2 \] \[ Pr(\text{X=T and Y=T}) = p*p = p^2 \]
H and T can occur in any order\[ \text{Pr(H and T) =} \] \[ \text{Pr(X=H and Y=T) or Pr(X=T and Y=H)} = \] \[ (p*p) + (p*p) = 2p^{2} \]
Joint probability
\[Pr(X,Y) = Pr(X) * Pr(Y)\]
Conditional probability
\[Pr(Y|X) = Pr(Y)\text{ and }Pr(X|Y) = Pr(X)\]
\[Pr(Y|X) \neq Pr(Y)\text{ and }Pr(X|Y) \neq Pr(X)\]
A binomial distribution results from the combination of several independent Bernoulli events
The distribution of probabilities for each combination of outcomes is
\[\large f(k) = {n \choose k} p^{k} (1-p)^{n-k}\]
n is the total number of trialsk is the number of successesp is the probability of successq is the probability of not successp = 1-q\[\large p^{k} (1-p)^{n-k}\]
This part (called the binomial coefficient) is the number of different ways each combination of outcomes can be achieved (summation)
\[\large {n \choose k}\] Together they equal the probability of a specified number of successes
\[\large f(k) = {n \choose k} p^{k} (1-p)^{n-k}\]
rbinom function to replicate what you sketched out for coin flippingbinom familyAnother common situation in biology is when each trial is discrete but number of observations of each outcome is observed
Pr(Y=r) is the probability that the number of occurrences of an event y equals a count r in the total number of trials\[Pr(Y=r) = \frac{e^{-\mu}\mu^r}{r!}\]
\[Pr(y=r) = \frac{e^{-\lambda}\lambda^r}{r!}\]
where \[\large \pi \approx 3.14159\]
\[\large \epsilon \approx 2.71828\]
To write that a variable (v) is distributed as a normal distribution with mean \(\mu\) and variance \(\sigma^2\), we write the following:
\[\large v \sim \mathcal{N} (\mu,\sigma^2)\]
Estimate of the mean from a single sample
\[\Large \bar{x} = \frac{1}{n}\sum_{i=1}^{n}{x_i} \]
Estimate of the variance from a single sample
\[\Large s^2 = \frac{1}{n-1}\sum_{i=1}^{n}{(x_i - \bar{x})^2} \]
The standard deviation is the square root of the variance
\[\Large s = \sqrt{s^2} \]
\[\huge z_i = \frac{(x_i - \bar{x})}{s}\]
Simulate a population of 10,000 individual values for a variable x:
x <- rnorm(10000, mean=50.5, sd=5.5)
Take 1000 random samples of size 20, take the mean of each sample, and plot the distribution of these 1000 sample means.
x_sample_means <- NULL
for(i in 1:1000){
x_samp <- sample(x, 20, replace=FALSE)
x_sample_means[i] <- mean(x_samp)
}
For one of your samples, use the equation from the previous slide to transform the values to z-scores.
Plot the distribution of the z-scores, and calculate the mean and the standard deviation.
Mean of x in each case 9 (exact)
Variance of x in each case 11 (exact)
Mean of y in each case 7.50 (to 2 decimal places)
Variance of y in each case 4.122 or 4.127 (to 3 decimal places)
Correlation between x and y in each case 0.816 (to 3 decimal places)
Linear regression line in each case \(y =3.00 + 0.50x\) (to 2 decimal places)
\[\huge \sigma_{\bar{x}} \approx s_{\bar{x}} = \frac{s}{\sqrt{n}} \]
Sadly, most other kinds of estimates do not have this amazing property.
What to do?
One answer: make your own sampling distribution for the estimate using the “bootstrap”.
Method invented by Efron (1979).
R to take a random sample of individuals from the original dataExamine the data_frames.R script
Examine the simple_boostrap.R script
On your own - use R to figure out the bootstrap distribution for other parameters (such as variance).
x <- c(0.9, 1.2, 1.2, 1.3, 1.4, 1.4, 1.6, 1.6, 2.0, 2.0)
sd() function.sd(x)/sqrt(10)
GacuRNAseq_Subset.csv data and obtain a bootstrapped estimate of the mean expression level.quantile() function to calculate these\(H_0\) : Null hypothesis : Ponderosa pine trees are the same height on average as Douglas fir trees
\(H_A\) : Alternative Hypothesis: Ponderosa pine trees are not the same height on average as Douglas fir trees
What is the probability that we would reject a true null hypothesis?
What is the probability that we would accept a false null hypothesis?
How do we decide when to reject a null hypothesis and support an alternative?
What can we conclude if we fail to reject a null hypothesis?
What parameter estimates of distributions are important to test hypotheses?
\(H_0\) : Null hypothesis : Ponderosa pine trees are the same height on average as Douglas fir trees
\[H_0 : \mu_1 = \mu_2\]
\(H_A\) : Alternative Hypothesis: Ponderosa pine trees are not the same height as Douglas fir trees
\[H_A : \mu_1 \neq \mu_2\]
\[\huge t = \frac{(\bar{y}_1-\bar{y}_2)}{s_{\bar{y}_1-\bar{y}_2}} \]
where
which is the calculation for the standard error of the mean difference
Make a dummy data set that contains one continuous and one categorical value (with two levels). Draw individuals of the two different factor levels from normal distributions with slightly different means. Take 100 obsv. per factor level.
Now, perform a t-test to see if the two populations are statistically different from one another
boxplot(continuous_variable~cat_variable, dataset name) t.test(continuous_variable~cat_variable, dataset name)
Repeat steps 1 and 2 above but use different sample sizes (e.g. 10 , then 100, then 1000, then 10,000 obsv. per factor level)
Repeat steps 1 and 2 above but now test between a categorical and continuous variable in the perchlorate data set (or any of your favorite data sets)
4 0 3 1 2 3 4 2 4 2 0 2 0 1 2 6 0 2 3 32 0 2 6 4 3 4 4 2 7 4 1 6 3 6 44 0 7 4 2 2 2 1 4 4 0 3 3 4 6 2 4 6 0 02 2 3 3 2 3 3 4 1 2 1 4 6 6 2 0Mean of x in each case 9 (exact)
Variance of x in each case 11 (exact)
Mean of y in each case 7.50 (to 2 decimal places)
Variance of y in each case 4.122 or 4.127 (to 3 decimal places)
Correlation between x and y in each case 0.816 (to 3 decimal places)
Linear regression line in each case \[ y = 3.00 + 0.50x\]
General Linear Model (GLM) - two or more continuous variables
General Linear Mixed Model (GLMM) - a continuous response variable with a mix of continuous and categorical predictor variables
Generalized Linear Model - a GLMM that doesn’t assume normality of the response (we’ll get to this later)
Generalized Additive Model (GAM) - a model that doesn’t assume linearity (we won’t get to this later)
All an be written in the form
response variable = intercept + (explanatory_variables) + random_error
in the general form:
\[ Y=\beta_0 +\beta_1*X_1 + \beta_2*X_2 +... + error\]
where \(\beta_0, \beta_1, \beta_2, ....\) are the parameters of the linear model
All of these will include the intercept
Y~X Y~1+X Y~X+1
All of these will exclude the intercept
Y~-1+X Y~X-1
Need to fit the model and then 'read' the output
trial_lm <- lm(Y~X) summary (trial_lm)
Write a script to read in the perchlorate data set. Now, add the code to perform a linear model of two continuous variables. Notice how the output of the linear model is specified to a new variable. Also note that the variables and dataset are placeholders
my_lm <- lm(dataset$variable1 ~ dataset$variable2)
Now look at a summary of the linear model
summary(my_lm) print(my_lm)
Now let's produce a nice regression plot
plot(dataset$variable1 ~ dataset$variable2, col = “blue”) abline(my_lm, col = “red”)
Notice that you are adding the fitted line from your linear model Finally, remake this plot in GGPlot2
SL - standard length of the fish
Testes Stage
Perchlorate_Data <- read.table(‘perchlorate_data.tsv’, header=T, sep=‘/t’) head(Perchlorate_Data) x <- Perchlorate_Data$T4_Hormone_Level y <- Perchlorate_Data$Testes_Area perc_lm <- lm(y ~ x) summary (perc_lm) plot(y ~ x, col = "blue") abline(perc_lm, col = "red")
\[H_0 : \beta_0 = 0\] \[H_0 : \beta_1 = 0\]
full model - \(y_i = \beta_0 + \beta_1*x_i + error_i\)
reduced model - \(y_i = \beta_0 + 0*x_i + error_i\)
Estimation of the variation that is explained by the model (SS_model)
SS_model = SS_total(reduced model) - SS_residual(full model)
The variation that is unexplained by the model (SS_residual)
SS_residual(full model)
\[r^2 = SS_{regression}/SS_{total} = 1 - (SS_{residual}/SS_{total})\] or \[r^2 = 1 - (SS_{residual(full)}/SS_{total(reduced)})\] Which is the proportion of the variance in Y that is explained by X
\[\beta_{YX}=\rho_{YX}*\sigma_Y/\sigma_X\] \[b_{YX} = r_{YX}*S_Y/S_X\]
\[y_i = \beta_0 + \beta_1 * x_I + \epsilon_i\]
What if your residuals aren’t normal because of outliers?
Nonparametric methods exist, but these don’t provide parameter estimates with CIs.
Robust regression (rlm)
Randomization tests
Leverage - a measure of how much of an outlier each point is in x-space (on x-axis) and thus only applies to the predictor variable. (Values > 2*(2/n) for simple regression are cause for concern)
Residuals - As the residuals are the differences between the observed and predicted values along a vertical plane, they provide a measure of how much of an outlier each point is in y-space (on y-axis). The patterns of residuals against predicted y values (residual plot) are also useful diagnostic tools for investigating linearity and homogeneity of variance assumptions
Cook’s D statistic is a measure of the influence of each point on the fitted model (estimated slope) and incorporates both leverage and residuals. Values ≥ 1 (or even approaching 1) correspond to highly influential observations.
Let’s look at the zebrafish diet data.
x <- zfish_diet$Protein y <- zfish_diet$Weight zfish_lm <- lm(y ~ x) summary (zfish_lm) plot(y ~ x, col = "blue") abline(zfish_lm, col = "red") hist(residuals(zfish_lm), breaks=30) plot (residuals(zfish_lm) ~ fitted.values(zfish_lm)) plot (residuals(zfish_lm) ~ x)
Or apply the plot() function to the linear model object directly
plot(zfish_lm)
Figure out what these plots are telling you
How about using the influence.measures function???
influence.measures(zfish_lm)
Do we have any high leverage observations we need to worry about??? Now go back and try to recreate these graphs in GGPlot2
\[H_0 : \beta_0 = 0\] \[H_0 : \beta_1 = 0\]
full model - \(y_i = \beta_0 + \beta_1*x_i + error_i\)
reduced model - \(y_i = \beta_0 + 0*x_i + error_i\)
Estimation of the variation that is explained by the model (SS_model)
SS_model = SS_total(reduced model) - SS_residual(full model)
The variation that is unexplained by the model (SS_residual)
\[r^2 = SS_{regression}/SS_{total} = 1 - (SS_{residual}/SS_{total})\] or \[r^2 = 1 - (SS_{residual(full)}/SS_{total(reduced)})\]
Which is the proportion of the variance in Y that is explained by X
\[y_i = \beta_0 + \beta_1 * x_I + \epsilon_i\]
What if your residuals aren’t normal because of outliers?
Nonparametric methods exist, but these don’t provide parameter estimates with CIs.
Robust regression (rlm)
Randomization tests
Residuals - As the residuals are the differences between the observed and predicted values along a vertical plane, they provide a measure of how much of an outlier each point is in y-space (on y-axis). The patterns of residuals against predicted y values (residual plot) are also useful diagnostic tools for investigating linearity and homogeneity of variance assumptions
Leverage - a measure of how much of an outlier each point is in x-space (on x-axis) and thus only applies to the predictor variable. (Values > 2*(2/n) for simple regression are cause for concern)
Cook’s D statistic is a measure of the influence of each point on the fitted model (estimated slope) and incorporates both leverage and residuals. Values ≥ 1 (or even approaching 1) correspond to highly influential observations.
Let’s look at the zebrafish diet data.
x <- zfish_diet$Protein y <- zfish_diet$Weight zfish_lm <- lm(y ~ x) summary (zfish_lm) plot(y ~ x, col = "blue") abline(zfish_lm, col = "red") hist(residuals(zfish_lm), breaks=30) plot (residuals(zfish_lm) ~ fitted.values(zfish_lm)) plot (residuals(zfish_lm) ~ x)
Or apply the plot() function to the linear model object directly
plot(zfish_lm)
Figure out what these plots are telling you
How about using the influence.measures function???
influence.measures(zfish_lm)
Do we have any high leverage observations we need to worry about??? Now go back and try to recreate these graphs in GGPlot2
x <- zfish$SL y <- zfish$Weight size_lm <- lm(y~x) summary(size_lm) plot(y~x, col = "blue") abline(size_lm, col = "red")
Note that when we run the lm() function we use Model I regression. Is this appropriate?
install.packages(“lmodel2”) library(lmodel2) size_lm_ModII <- lmodel2(y ~ x) size_lm_ModII abline(a=0.1912797, b=0.1097880, lty=2)
You should have generated OLS, MA, and SMA regression models, and the last plots SMA line from parameter estimates
Say you measured green fluorescent protein abundance in cell culture across several key, carefully controlled temperatures.
You ultimately want to know whether protein expression changes as a function of temperature in your experiment.
Read in the gfp_temp.tsv data and perform a regression analysis.
What is our decision regarding our null hypothesis of no relationship?
The function is to objectively present your key results, without interpretation, in an orderly and logical sequence using both text and illustrative materials (Tables and Figures).
The results section always begins with text, reporting the key results and referring to figures and tables as you proceed.
The text of the Results section should be crafted to follow this sequence and highlight the evidence needed to answer the questions/hypotheses you investigated.
Important negative results should be reported, too. Authors usually write the text of the results section based upon the sequence of Tables and Figures.
May appear either in the text (usually parenthetically) or in the relevant Tables or Figures (in the legend or as footnotes to the Table or Figure). Each Table and Figure must be referenced in the text portion of the results, and you must tell the reader what the key result(s) is that each Table or Figure conveys.
For example, suppose you asked the question, "Is the average height of male students the same as female students in a pool of randomly selected Biology majors?" You would first collect height data from large random samples of male and female students. You would then calculate the descriptive statistics for those samples (mean, SD, n, range, etc) and plot these numbers. Suppose you found that male Biology majors are, on average, 12.5 cm taller than female majors; this is the answer to the question. Notice that the outcome of a statistical analysis is not a key result, but rather an analytical tool that helps us understand what is our key result.
Report your results so as to provide as much information as possible to the reader about the nature of differences or relationships.
For example, if you are testing for differences among groups, and you find a significant difference, it is not sufficient to simply report that "groups A and B were significantly different". How are they different? How much are they different?
It is much more informative to say something like, "Group A individuals were 23% larger than those in Group B", or, "Group B pups gained weight at twice the rate of Group A pups."
Report the direction of differences (greater, larger, smaller, etc) and the magnitude of differences (% difference, how many times, etc.) whenever possible.
Statistical test summaries (test name, p-value) are usually reported parenthetically in conjunction with the biological results they support. This parenthetical reference should include the statistical test used, the value, degrees of freedom and the level of significance.
For example, if you found that the mean height of male Biology majors was significantly larger than that of female Biology majors, you might report this result (in blue) and your statistical conclusion (shown in red) as follows:
If the summary statistics are shown in a figure, the sentence above need not report them specifically, but must include a reference to the figure where they may be seen:
To develop a better predictive model than is possible from models based on single independent variables.
To investigate the relative individual effects of each of the multiple independent variables above and beyond the effects of the other variables.
The individual effects of each of the predictor variables on the response variable can be depicted by single partial regression lines.
The slope of any single partial regression line (partial regression slope) thereby represents the rate of change or effect of that specific predictor variable (holding all the other predictor variables constant to their respective mean values) on the response variable.
Additive model \[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ... + B_jx_{ij} + \epsilon_i\]
Multiplicative model (with two predictors) \[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + B_3x_{i1}x_{i2} + \epsilon_i\]
RNAseq_Data <- read.table(‘RNAseq_lip.tsv', header=T, sep=‘\t') y <- RNAseq_Data$Lipid_Conc g1 <- RNAseq_Data$Gene01 g2 <- RNAseq_Data$Gene02 g3 <- RNAseq_Data$Gene03 g4 <- RNAseq_Data$Gene04 Mult_lm <- lm(y ~ g1*g2) summary(Mult_lm)
Add_lm <- lm(y ~ g1+g2) summary(Add_lm)
RNAseq_Data <- read.table(‘RNAseq_lip.tsv', header=T, sep=‘\t') RNAseq_lm_linear <- lm(y ~ g4) summary (RNAseq_lm_linear) influence.measures(RNAseq_lm_linear) RNAseq_lm_poly <- lm(y ~ poly(g4, 2)) summary (RNAseq_lm_poly) lines(y, fitted(lm(y ~ poly(g4, 2))), type="l", col="blue") influence.measures(RNAseq_lm_poly)
library(car) scatterplotMatrix(~var1+var2+var3, diag=”boxplot”)
RNAseq3_lm <- lm(y ~ g1+g2+g3) summary(RNAseq3_lm) plot(RNAseq3_lm)
library(car) scatterplotMatrix(~g1+g2+g3, diag=”boxplot”) scatterplotMatrix(~y+g1+g2+g3, diag=”boxplot”)
To get tolerance (1-\(R^2\)) calculations
1/vif(lm(y ~ g1 + g2 + g3))
How to decide the complexity of polynomial: straight line regression, quadratic, cubic, ….
Which variables to keep/ discard when building a multiple regression model?
Selecting from candidate models representing different biological processes.
"models should be pared down until they are minimal and adequate”
Crawley 2007, p325
Model should predicts well
Approximates true relationship between the variables
Be able to evaluate a wider array of models. Not only or more “reduced” models.
NOTE: Reduced vs. full models are referred to as nested models. Non-subset models are called non-nested models.
Don’t confuse with nested experimental designs or sampling designs.
How to accomplish these goals To answer this, we need
How to decide the complexity of polynomial: straight line regression, quadratic, cubic, ….
Which variables to keep/ discard when building a multiple regression model?
Selecting from candidate models representing different biological processes.
MSresiduals- represents the mean amount of variation unexplained by the model, and therefore the lowest value indicates the best fit.
Adjusted \(r^2\) - (the proportion of mean amount of variation in response variable explained by the model) is calculated as adj. r2 which is adjusted for both sample size and the number of terms. Larger values indicate better fit.
Mallow’s Cp - is an index resulting from the comparison of the specific model to a model that contain all the possible terms. Models with the lowest value and/or values closest to their respective p (the number of model terms, including the y-intercept) indicate best fit.
Akaike Information Criteria (AIC) - there are several different versions of AIC, each of which adds a different constant designed to penalize according to the number of parameters and sample size to a likelihood function to produce a relative measure of the information content of a model. Smaller values indicate more parsimonious models. As a rule of thumb, if the difference between two AIC values (delta AIC) is greater than 2, the lower AIC is a significant improvement in parsimony.
Schwarz or Bayesian Information Criteria (BIC or SIC) - is outwardly similar to AIC. The constant added to the likelihood function penalizes models with more predictor terms more heavily (and thus select more simple models) than AIC. It is for this reason that BIC is favored by many researchers.
leaps package in R. Smart search among a potentially huge number of modelsFor prediction: All models with Cp < p predict about equally well. Don’t get carried away with a “best”.
For explanation: If numerous equally well fitting models fit the data, it is difficult to deduce which predictor “explains” the response.
General caveat: “regression is not causation”. Experiment needed to get to causal explanation.
From Langford, D. J.,et al. 2006. Science 312: 1967-1970
In words:
stretching = intercept + treatment
- The model statement includes a response variable, a constant, and an explanatory variable.
- The only difference with regression is that here the explanatory variable is categorical.
RNAseq_Data <- read.table('RNAseq_lip.tsv', header=T, sep='\t')
g1 <- RNAseq_Data$Gene01
Pop <- RNAseq_Data$Population
boxplot(g1~Pop, col=c("blue","green"))
Or, to plot all points:
stripchart(g1~Pop, vertical=T, pch=19, col=c("blue","green"),
at=c(1.25,1.75), method="jitter", jitter=0.05)
Pop_Anova <- aov(g1 ~ Pop)
summary(Pop_Anova)
R, lm assumes that all effects are fixedlme instead (part of the nlme package)dplyr functions.RNAseq_Data <- read.table("RNAseq.tsv", header=T, sep='')
x <- RNAseq_Data$categorical_var
y <- RNAseq_Data$continuous_var1
z <- RNAseq_Data$continuous_var2
contrasts(x) <- cbind(c(0, 1, 0, -1), c(2, -1, 0, -1), c(-1, -1, 3, -1))
round(crossprod(contrasts(x)), 2)
rnaseq_data_list <- list(x = list(‘xxx vs. xxx’ = 1, ‘xxx vs. xxx’ = 2, ‘xxx vs. xxx’ = 3))
RNAseq_aov_fixed <- aov(y ~ x) plot(RNAseq_aov_fixed) boxplot(y ~ x) summary(RNAseq_aov_fixed, split = rnaseq_data_list)
perc <- read.table('perchlorate_data.tsv', header=T, sep='\t')
x <- perc$Perchlorate_Level
y <- log10(perc$T4_Hormone_Level)
MyANOVA <- aov(y ~ x)
summary (MyANOVA)
boxplot(y ~ x)
install.packages("multcomp")
library(multcomp)
summary(glht(MyANOVA, linfct = mcp(x = "Tukey")))
Survival of climbers of Mount Everest is higher for individuals taking supplemental oxygen than those who don’t.
Why?
The goal of experimental design is to eliminate bias and to reduce sampling error when estimating and testing effects of one variable on another.
\[ Power \propto \frac{(ES)(\alpha)(\sqrt n)}{\sigma}\]
Power is proportional to the combination of these parameters
perc <- read.table('perchlorate_data.tsv', header=T, sep='\t')
x <- perc$Strain
y <- log10(perc$T4_Hormone_Level)
MyANOVA <- aov(y ~ x)
summary (MyANOVA)
boxplot(y ~ x)
Based on your results, calculate the power for your ANOVA.
pwr.t2n.test(n1=xxx, n2=xxx, d=xxx, sig.level=.05, power=NULL)
Check out the functions in the ‘pwr’ library (Unnecessary in this case, but could use ANOVA version):
pwr.anova.test(k=2, n=190, f=0.25, sig.level=.05, power=NULL)
Each pair of dots represents the two measurements
andrew_data <- read.table('andrew.tsv', header=T, sep=‘\t')
head(andrew_data)
andrew_data$TREAT2 <- factor(c(rep(“low”,40),rep(“high”,40))
andrew_data$PATCH <- factor(andrew_data$PATCH)
andrew.agg <- with(andrew_data, aggregate(data.frame(ALGAE),
by = list(TREAT2=TREAT2, PATCH=PATCH), mean)
library(nlme)
andrew.agg <- gsummary(andrew_data, groups=andrew_data$PATCH)
boxplot(ALGAE ~ TREAT2, andrew.agg)
nested.aov <- aov(ALGAE ~ TREAT2 + Error(PATCH), data=andrew_data) summary(nested.aov)
library(nlme) VarCorr(lme(ALGAE ~ 1, random = ~1 | TREAT2/PATCH, andrew_data))
rnadata <- read.table('RNAseq.tsv', header=T, sep='')
head(rnadata)
gene <- rnadata$Gene80 microbiota <- rnadata$Microbiota genotype <- rnadata$Genotype boxplot(gene ~ microbiota) boxplot(gene ~ genotype) boxplot(gene ~ microbiota*genotype)
rna_aov <- aov(gene ~ microbiota + genotype + microbiota:genotype) rna_aov <- aov(gene ~ microbiota*genotype)
plot(rna_aov) summary(rna_aov) anova(rna_aov)
IntPlot_data file and IntPlot_Example.Rcontinuous response and two main effect variables
rnadata <- read.table('RNAseq.tsv', header=T, sep='')
gene <- rnadata$Gene80
microbiota <- rnadata$Microbiota
genotype <- rnadata$Genotype
make new “pseudo factor,” combining genotype and microbiota
gxm <- interaction(genotype,microbiota) levels(gxm) boxplot(gene ~ gxm)
specify the following 2 contrasts
contrasts(gxm) <- cbind(c(2, -1, 0, -1), c(-1, -1, 3, -1))
Fit the factorial linear model
rna_aov <- aov(gene ~ gxm)
Examine the ANOVA table, using supplied contrasts. Figure out the appropriate titles to give them.
summary(rna_aov, split = list(gxm = list('xxx'=1,'xxx'=2)))
What does the contrast summary tell you about the nature of the interaction?
nlme & lme4 packages in R.rnadata <- read.table('RNAseq.tsv', header=T, sep='')
head(rnadata)
variables excluding first 5 and last 5 observations
gene <- rnadata$Gene80[6:75] microbiota <- rnadata$Microbiota[6:75] genotype <- rnadata$Genotype[6:75] boxplot(gene ~ microbiota) boxplot(gene ~ genotype) boxplot(gene ~ microbiota*genotype)
Estimate the variance components using Restricted Maximum Likelihood (REML)
library(lme4) lmer(gene ~ 1 + (1 | microbiota) + (1 | genotype) + (1 | microbiota:genotype))
Based on the REML sd estimates, what are the relative contributions of the factors to total variance in gene expression?
longevity_data <- read.table('longevity.csv', header=T, sep=',')
head(longevity_data)
Variables
long <- longevity_data$LONGEVITY treat <- longevity_data$TREATMENT thorax <- longevity_data$THORAX
boxplot(long ~ treat) plot(long ~ thorax)
plot(aov(long ~ thorax + treat ), which = 1)
plot(aov(log10(long) ~ thorax + treat ), which = 1)
library(lattice)
print(xyplot(log10(long) ~ thorax | treat, type = c("r", "p")))
anova(aov(log10(long) ~ thorax*treat))
anova(aov(thorax ~ treat))
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
"Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space"
— Edward Tufte
xxx
xxx
Draw graphical elements clearly, minimizing clutter
xxx
Represent magnitudes honestly and accurately
xxx
xxx
xxx
xxx
“Graphical excellence begins with telling the truth about the data” – Tufte 1983
“Graphical excellence consists of complex ideas communicated with clarity, precision and efficiency” – Tufte 1983
xxx
xxx
xxx
geom_bar functionggplot(data=diamonds) + geom_bar(mapping=aes(x=cut))
Now try this…
ggplot(data=diamonds) + geom_bar(mapping=aes(x=cut, colour=cut))
and this…
ggplot(data=diamonds) + geom_bar(mapping=aes(x=cut, fill=cut))
and finally this…
ggplot(data=diamonds) + geom_bar(mapping=aes(x=cut, fill=clarity), position="dodge")
geom_histogram and geom_freqpolyfunctionWith this function you can make a histogram
ggplot(data=diamonds) + geom_histogram(mapping=aes(x=carat), binwidth=0.5)
This allows you to make a frequency polygram
ggplot(data=diamonds) + geom_histogram(mapping=aes(x=carat), binwidth=0.5)
geom_boxplot functionBoxplots are very useful for visualizing data
ggplot(data=diamonds, mapping=aes(x=cut, y=price)) + geom_boxplot()
ggplot(data=mpg, mapping=aes(x=reorder(class, hwy, FUN=median), y=hwy)) + coordflip()
ggplot(data=mpg, mapping=aes(x=class, y=hwy)) + geom_boxplot() + coordflip
geom_point & geom_smooth functionsggplot(data=diamonds2, mapping=aes(x=x, y=y)) + geompoint()
ggplot(data=mpg) + geompoint(mapping=aes(x=displ, y=hwy)) + facet_wrap(~class, nrow=2)
ggplot(data=mpg) + geompoint(mapping=aes(x=displ, y=hwy)) + facet_grid(drv~cyl)
ggplot(data=mpg) + geomsmooth(mapping=aes(x=displ, y=hwy))
ggplot(data=mpg) + geompoint(mapping=aes(x=displ, y=hwy)) + geomsmooth(mapping=aes(x=displ, y=hwy))
ggplot(data=mpg, aes(displ, hwy)) +
geompoint(aes(color=class)) +
geomsmooth(se=FALSE) +
labs(
title = "Fuel efficiency generally decreases with engine size"
subtitle = "Two seaters (sports cars) are an exception because of their light weight"
caption = "Data from fueleconomy.gov"
)
xxx